t 

2D 126 103 TS OOS 342 

&UTSOS 3eich.jLa3/ Susan L.; Oosterbcf/ ilberx 

Str<ategj ficLideiiaes for the Coa^ructioa of ftastery 

?03 3iS2 [l^z 16} j , . 

BDT2 26p.; Paper preseated at the ilaual fieetiag of -the 

Afiericaii^rdacatioDal 2esearcb lssociati<Sa (60xh^ San 
Francisco, ^California, Ipril 19^23, 1S76) . 

5DSS PHIC2 • H?-$0.83 SC-$2i06 Plus Postage. 

D2SC2IPTOBS ^Criterion Referenced Tests; Decision Haiking; • 

^^nidelines; High School Stuaenrs; Instructional 
'5esxgnT~*Bod€ls; ?a2.x era^iin^ Tro5a51X£tyr^ 

Staristieal inalysis; ^Student Grouping; Student 

Zesting; *Z^st Constructi^^^ 5 TiHte 
I?ZS5I?IZBS *Baster/^ Tests 

ABSlriCr 

Tarious procedures and guideliz^^^^haTe been suggested 
for. the developi&ent and.ccnstructioi: of criterion^^^i^erenced tests. 
The pre^nt paper prpposes a coiprehensive i^odel vhich allows the 
user to identify and relate specific coiponents which affect the 
optisal construction and iaple^entation strategies of 
criterion-referenced tests. Purtheraore, it establishes parateter 
valines which will allov the' classification of indiriduals into 
Kastery or nonmastery states with prespecified levels of coi>f?.dence. 
Although the discussed aodel incorporates binomial expansions^ it 
uses parameters of selected ite»s for establishing baselines 
probabilities instead of tyue scores derived fro* an assuaed * 
population of items, (luthor) * 



♦ Documents acquired by 2SIC include many informal lanpublished . ♦ 

♦ materials net available from other sources. 2EIC aakes ^>ery effort ♦ 

♦ to- obtain the best copy ^available. Severthelass^ iteis of marginal ♦ 

♦ reproducibility, are often encountered and this affects the .iguality ♦ 

♦ of the microfiche and hardcopy reproductions 2HIC aaJces, available ' 

♦ via th^ EPIC Document Beproduction Service <2I)BS) . ZDBS is not ♦ 

♦ responsible fcr the quality of the original dOcumenl# Eeproductions ♦ 

♦ snpplied by 2DB$ are the best .that caji be made froi the originals ♦ 



ERIC 




152 2>2£=;2i - ^ . .... . ^-1 



III, iilH!! strategy Suidelloes for- the Constraction of JWastery Tests' 



Susan L. Relchnan arrf Albert C. Oosterhof 
Horida State University 



Various procedures have been suggested for pie developnent and con- 
struction of cHterion-r^erenc^ tests. Tbe present 'Investigation pro-. 

. I ■ " ... 

^ ~ poses "a node! tth1cJ»'al1oi« the user to liiantify hud relate specific 

^ . factors which affect tfe optiisal ocmstruction and '1n$)leraent3tion strat- 

egles of criterion- referenced tests. TkH inodej incorporates enip^r- 
^ Ically derived data to establish situational pa^f^ters for the pro- 

£ posed laodel/and then uses this data to illust^F^ the inodel •s^ad3pt- 

'^' ' ' • < V 
iBIlity to an applied situation. 

Application of the proposed inKfet Ms inpHcations both to the. 
areas of Instnictional desfgi) .aJid ciastery testing strategies. Rajor 
■ ' ' ' implications to the inst^ctional designer Include a cleans *^reby tiine 
can be neanisgfully appropHated within a course for Instruction and 
student assessr^-nt. The model provides- the Instructional designer with 
"a procedur4 for deteralning a feasible nunib^' of pass/no-pass decision 
• , points to .'include within a course of Instruction on the basis of values 
established for the individual components within the oodel.- Finally, 
^ .coaponents which have previously been consldered^ndependently, with- 

^ out att^tlon to interrelationships, are Incorporated In such a manner 

-CO within /the codel th^t>^en one coapphent Is alts^ the user becomes 

aware of the resoltihg loplfcations to changes in the renalning of the 
- "\iDodel- * . • 



V 

ERIC 



~ t h: paper presented at the Annual Meeting of the AperiQn 
EjSucatfonal Research Association, San Francisco, April, 1976. 



/ 



-2- 

AltiKx^S^ft}^ discsissed TXfial ifeccrpsrates binpoial QcpessioJK, It 
uses osfeetsrs of selected itenss fofyestablishing baseJIne probabilities 

>ad of true scores dsriy&i froc an assunsd pcpalatioa of itens- Furtfeer- 
core, tiSiereas paraiaeters associated %dtb a heterc^eneoas group of st^jects 
vfere incorporate! into the orrreRt investigation, procedures are de- 
■ scribed i^hicti allow application of the proposed irodel to different 
groups of students vary in the degree to which they dtister arcssd 
a criterion score. . _ 

^The follo»drQ con?>onents are included in the nodel: 

1. Anount of student tine or average response latency a ssccia- . 
ted Witt a specific itera forcat \sx.\. 

2. Total amount" of student time to fc§ allocated to t&>ting tT]. ' 

3. The nuaber of iters selected for detertrfning pass/no-pass 
decisions on each decision point 

4. The nunijer of lastery-statas decisions cade within a 
\ specific interval of instruction tnj- 

^' » * - 

5. Probability for Itidividuals performing at'a specified 
dxjain score of being placed in the correct castery state, 
given the ninism passing percentage score and the average 
difficulty of? ijtens selected for wking the Piss/no-pass 
decision on a particular decision point 

^ IP(A)£P(*^.P(^^,^))]. 

The relsticnsriir ostween the nunber of iteas assign^ to a decision point, 
the total anount of tise allocated totestfng, the jyenjffe rKponse.Ts; 
tcncy per itea, and the number of nsastery-sta^is decisions rsade within 
a specified^ Interval of instruction is defined as 

k= _T 

• "A 



o 

ERIC 



Tbe probability of classifying an individjal into the ccrrect sistery 
state is (ftteraiftsd by selecting tJie appropriate tens frcra the binCErfal 

• 

&3gunt of stog'ant time assodated with a specific ite a fonat 
i 

The average 'response latenpy tine regtnre^ of the stodent to respcsd 
to vartoas types of itea fcnnats kis aai inverse relationship to the nua- 
ber of itess that can be presented to the student within ai allocated 
tine. Across the range of various focsat options (such as sirnple factual 
multiple choice, true/false, complex railtlple choice as requiring prcb- 
lem solving, or cocjplex rathenatics aj)d reading selections) estirates . 
of response latency tiine rai^e frora 30 to 300 sectmds per itea. With- 
in a specified acount of tini a student can respond to.fioi^te nicfeer 
of itsas. One should consider this practical linitation when designing 
critericn-referavced tests. 

If the instructor or Instructional designer fails to take into con- 
sideration response latency of the particular type of itea selected var- 
icus problems coi>ld aHse. Take for exaniple a situation where fifteen 
tjinutes were allocated to testing a decision point ar^ twsjty itesns 
were required to obtain the desired probabilities of oisclassifi cation; 
the problea — iteas selected were of the cocplex raultiple choice type, 
requiring rore than 50 seconds each. Had the designer taken into account 
the type of itea and its respective response latenqy tice various <lec1- 
sions could have been altered: .{1) core tine couTd have been allocated 
far testing, keeping the probabilities and nusfcer of iteris the sane, (2) 
the aaount of tine and nuaber of iteos could have been held constant. 



reducicg tJi2 pmbibilfties of ciisclassiflcation, cr (3) the znount'of 
tifss vith a shsrter response lataocy tii^ tised. Ey consideripg the 
response latSRcy tine of selected iteas along irith'cthei: caij;iKsents in 
the nExJal , tte nost useful confirmation of con^JDnaiits can be selected. 

Total anount of student tim allota^ toiiestln? - - 



The anount of stodent tiiK^tbe inst^yctcr is tilling to allocate to 
testing directly affects the nunfeer and type of decision points tijat can 
realistically be incorporated within any given course of instruction- 
Various authors such as Thorodike and Hagen {1959) and Kovick.and Lev-is 
(1974) suggest that the perc*?.t of tire that the instructor is willir^ to 
allocate to testing is a practical factor idiich liraits-test length. 
Siveii an upper- limit to the total amount of tine available 'for instruc- 

» • • • 

tion an'd'assesscent, a finite nun4>er of tests consisting of a spectfied 
nunjber ar»d t;ype of iters can be given to a stvdent. As an increased 
nusfcer of decision ^ints are- incorporated into a set course of instruc- 
tion, a student oust ''be certified as having attained cs^stery over a test 
for each respective decision point. As' more tests are incorporated into 
the course, MoustraantS Dust be raade witJi regard to other coaponents in 
the rcdel4-a3t«nng the total ar»unt of sfudent, tine allocated for test- 
ing has an observable affect upon other conpd^^ts. 

Tht mg6er of It e as selected fo r deternining c asterv status 

The nuirbsr of itens used for detenaining a student's bastery status 
within an individual docain can be calculated oh the basis of values for 
the total ainount of student tinse to be allocated to testing, the no^er 
of castery status decisions cade within a. specified interval of instruction, 

5^ 



V 



zni t»/e averega response latsicy associated viih the specified item 
fcrnat^— H&^Sfif tw3^-of tfee four yari gbles ccasta^* the, instruc- 
tloTial <Jesigner ci.n alter ihe tfiird'and cccpate the required v^lue fcrtbe 
fourth conponent. *In this caitrser valces for thes§ fosF vaHables can be 
lanipylatei and alters until a satisfactory ton&inaticn 'is reacted. 



ttunfeer of castery-status decisic^ts vithin a specif j ej in'teryal 

The nu36er of decisien points corfeined within a cccrse of instn;c- 
tion is Inversely related to th^ju^ter^^^f-itenrm]^ tBesSretieaTTy 
or practically associatedn^Tth each decision point- Holding the type of 
item cortstantr-ai the ounber of it^ relating to eacli decision point is 
increased ncrre tiite must be set aside for assessnent. * Sfenlarly, if the 
tiiae altdcated for test4ng is' held constant for a-glven course and the 
noEfcer of itess for each decision point is ir*creased, the nurber of 
feasible decision points decreases. 

The total- nugber of decision ptjints which it is feasible to include^ 
»dthin a course of fijstruction is further affected by the actual type of 
decision point. Has^jleton.and Novick (1973) indicate that when total . 
testing tine i^ fixed and there is, interest in laeasuring cany co:npetCT- 
cies, the problem arises as to whether one should obtain Very precise 
infcTPation about a^scall nunber of coapetsicies or less precise infor- 
mation about/CTny ton. ccEipetencies.- Cronbach and 6l6se/ (1965) identi- 
fied this^ relationship as the bandwidth-fidelity dileroa; bandiddth ref- 
eri«^^ the variety of inforration obtained froa testing and fidelity as 
the thoroughness of testing to obtain oore coc^lete inforcation. Ci^ertding 
upon *fhere on the bandwidth-fidelity contijiuua the. instructional designer 



6 



ERJC 



re- 



elects to test, varyiDj nurfcers of Itens will be reg-ilr^; thus affecting 
the ammt pf tine to be allocated to testfrg end the to^l type a«d nuD- 
ber of dec^iioa points that it is practical to assess. ^ 

Pro^ility^ beiM placed in -the corrsct fsastery state 

The pTt l^ility of a stiKient being pUc«J in the correct nastery 



state, is dW^e?.t iipc* the trinim* pacing percentage score and the 
average difficWy of ^lect^ ite^ for iRdividual's perfcming at a 
specified true-^it. score. The desired probability of being placed in 
the correct nastery^te is selected'on the basis^f the in?,lications 
associated vith'^king a false-positive or false-ng^tive action with 
^respective decision points. , ^ 

Altering the'^al oinic^n passing leyel^^r crj^eritm^evel-^^ 
'be used as a oej^>.r^ke student classifications i«re or less^definitive 
(Gagne I Briggs. 19?4). Hoi^^er, this aR>r«ch also has its tr^de-offs. 
As the criteHon level is noved «p«rd toi«rd lOOS correot a greater nusw 
ber of nastery people nay be classified into the no'nnnastery ^p; false- 
negative actions ar? nore prevalent. As the criterion level is mved down 
ward an increasedi?«*er of non^astery people will be classifi^.into 
the eastery group, resulting in oore false-positive^actfons- 
' The average difficulty of selected itaas -#6r individual? perfo^^'ng 
at a specified domain score is the best estimate of the probabili^ that 
one given itea froQ a donain «ill classify an individual into an/inap- 
firopHate ^stery or non-^sastery state. The observed dffficulfy for a ^ 
given itetn is dependent upon the observed characteristics of the. actual 
Item selected and the dooain score under consideration. M various itens 



are sanjpled and/cr different (ionain scores considered this prDbsbillty 
dianges.- ■ If each iteo »as chosen canpletely at raadoo froo a danaio 
of alls potential Itens and if a gi^ individual had an equal prcj>- 
ability of corr^ly resf^ng to each itea Ic tJie dccain t^.en doaio 
scores derived frtp an assured population of itecs (as io the birjoofal 
i!odel) coald loused as the probahill^-that one itea *»ou1d correctly 
classify an individual into castery or ncn-rastery catagories. Since 
these two assunptiohs are seldor net in the classrooo sifcjaticn it 
would be core preferable to use th^actual iten paraneters for estab- 
lishing these baseHne probabilities. 

Einpirical Deteraifation of Probabilities 
Collecting enpiHcal data upon which.. to .'i^se binoesial prt^ 
ajilitifis of correctly classifyii^ students into castery or non- 
nastery catagories is necessaryMufe directly to assumptions under- 
lying the binooial oodel. Itore specifically, these assunptions arc - 
that each test itea is chosen randooly from a donain of all- potential 
itetssl and that a given individual h«Kan equal probability of cor- f 
rectly responding to each iten that -sigbtie sel^ted froo the 

In laost cases, a snaH nucfcer of itesas froa the pot^tial docain 
are gaierated to oeet a specific purpose in aind. Iteos are 'not rafl- 
selected f ron a Urge iteo doeain. Furthenaone.^ selected itaas 
^ will tend to vary in difficulty and other characteristics, TMs^ is 
true of even the.best defined dotsains (e.g- rail tiplf cation of all 
single digit nuabers includes such iteas as 2 x 2, 7 x 8^ and 5 x 6). 



. ■ • . --8. 

^ ^ 

-■Because these two inplicit assumptions jr& not wsily net, procedures . 

based upon an assuoed population of itsas is perhaps not tjpe rost appro- 
' priate approach to take for 4Bternin1ng nastery test length Therefore,. 

it las.felt necessary to .eapirically deterraine>iti'a real situatioi},' 
► baseline probabfinies dertved froa pa'raoeters of selected itecis rather , 

than using prc^^ilitifis deHved'froa an assisted population. of iteas. 

P rocedure ' 

1 

A con?«iter ^rograa was^written to carry Qut the data analysis prcrce-- 
dures -describecl'below. For each subject's test score each te«t itesj ws 
.scored «Jichotorao«s1y and a total doaain score calculated. " The first 
criterion ifevel was then set and all subjects ct^^^ed on the basis 
of that docain and cHteHon level into mastery erf non-casteny <3tagoHes- 
Criterion levels incorporated into the program included 100%, 90%, 80%, 

70%,, 60%, and 50%. ^ 

To calculate tbe probability* on a single item, of a subject being 
plac^ in the correct mastery state,, given the rafnimum passing score, 
the average difficulty of included itefts was computed for subjects per- 
forraing at each possible domain- score.. This analysis provided the base- 
Tine data for binomial expansions to determine probabilitiesr of subject 
nisclasiifi cation as test lenghts increased. • 

' At this point test item subsets within ^he first domain were se- 
lected and analyzed to determine the effects upon probabilities of mis- 
classifications as actual test size was reduced.' Items included in each 
subset were not randomly. select^, but were sele<?ted using a deliberate 



9 



ERIC 



plan/- This was done to nwre nearly represent" tB deliberate generation 
or selection of iteas that occurs when instructors develop tests with a 
specific purpose Vn Blind: ' The size of "the itetn jubsets waa reduced by 
■intervals of eirht until the final subset fof^the -''.omain equall-ed eight. 

' For ^ch test sul)?et size the probability of Subject misclassifica- 
(^on, given one Itenu was calculated.^ Each test subset size was-rep- • 
Ifcated Wye times with a. mean. and standard deviation calculated from 
ther&sulting five probabilities, the average probability. of, a rais- 
cUssific^tion was used as the baseline entry for the binomial expan- 

sions to detennine probabilities of subject mis-classification as test^ 

• * ■ -. , » ■ • . . 

lengths are increased. 

Once all test size.subsets of- the first domain had been analyzed 
under all six criterion^ levels a new dortafn rfas created by randoMly 
eliminating eight tssfitcms. total- of seven domairfs, ranging in 
size from fifty-six to ci^H were investigated, aithin each domain 
'test subsets were agaih' selected and probabilities of subject mis- 
classi^fication determined under each of the six criterion levels. • 

" Subjects * . . . . 

* ' Probabilities of subject misclassiff cation" were empirically deter*. 

.mined based upon da.ta collected throughout the administration pf a 56 

item test to 1281 subjerts; 49% fema;ies and §1%, males. These subjects 

were from 57 voTunteer classrooms in the following seven states: * 

•Arkansas - 16%, California - 12%, Kansas - 33%, Maine -.12%, New York .- 

"T6%rutS^1^','ltK^^ 4%. Of .these 57 classrcfoms, 29% of the 

instructors identified their classrooms as 'rural, 36% ai urban, and 35% 



as suburban. AlV^ub^ects were participants In fietdtesting rri^tenals 
ifeveloped by the Individualized Sciende Instructional Systetn {ISIS)j 
a project funded the . Nat.ionaT Scierfce Foundajtioit; engaged in the 
development of discrete. Instructional modules in a variety <)fscience'. 
topics. 922 of the subjects were taking ISIS in their science class. 
Tor t he f irst t ime, -while 7% h&6. IS,IS last year.— — ~ 



The gride levels of the participants rangedfrom nine to twelve; ^ 
■ ■ ■ • ' . '1 . • ~ , ■ 

m ninth graders, 55% tenth graders, 16% eleventh graders, and 11% 

twelveth graders. Ages »*apged from 1% twelve to thirteen years, 14% 

fourteen years, 43% ;fmeen years , 28% sixteen years, and 15% seven- 

teen or older.>^96% of the subjects indicated that they plan to go on^ 

'■••*'. ■ ' 

ta college. ■ . . ' 

ftt' indicating what the- classroom teadher percjh^ed^he overall- • 

sacio-economic level of their class tdj^e^^^ftomparison to the ^lation, 

27% of the teachers identif^edtheir classes as. average, 50% below " 

''average, adn 22% above average. In rating the socio-economic level 

of their classes in comparison to the rest of thefr school, 74% iden- 
' ■ ... * . . 

tified their class a§ average,- 22%. below average, and 4% above averager- 

♦ 

Development of the test instrument ' ' . 

The student, test 'materials, consisted, of §6 fdur r^ponse .multiple 
choice it6ms covering the following defined domain of information; the/ 
coimon and scientific names of 14'majorWes ia the body.. An item ^ 
form approach was -used to systematically gfener^ate each of the items, ' 
with distractors^.ahd item test position, withi rigour main test sectiorts 
randomly assigned! . The domain,was seTectecl^^om conterit.of the ISIS- 



nrinicourse Keeping Fit! This docnain was sielected because .it represents 
a tfeli-defined and finite area of infonnatian which could be completely 
saropTea; Within this domain, i^em$>' even though generated using an item 
form approach, would be content ificlude iterais of varying difficylty 
• levels. Further, the domain was selected because available subjects 
wouWirave various levels of attainment within the domaan, ensuring a- ; 
vide range of scores on the jtest. In^ regard -to exposure to the contents- 
of this minicourse, 32% of the subjects indicated that they had done 
the minicourse this year, 2% had done.. an earlier^ version includina , 
the domain last year, and 652 had never done the minicourse. 

Two rrajor approaches were, taken, to ensui^ the validity of the 56 
test items. First, an ftem form approach va^s used to generate all iteras: 
The use of Item forms has been identified by' various authors as Baker 
J[1974) and Haiffcleton and Novick 0973) as a systematic;, approach to 
establishing content validity. After the actual geftepation of. the 
items content, experts were used to determine whether or not the , items 

' ' ' • 

•for the domain did in fact represent that dpniain .^hese .content experts 
were evaluators -ind writers from tlje ISIS project staff. 

' i' , ' ' - . - 

Tett Ch aracteri stics » , \ " ^ 

. Subjects taking the test represented a heterogeneous group with 

' respedt to tTielF ex{iosure to tlTfe contents of the tested doraairi, and 

thefefore representejd a potential of saropl^nti fairly well the continuum 

.?f:«0 /total non-iB^^ io cpm|)lete rnastcry*- T^^o fact that X5a<;u item cpn 

- . ~, ' * - - 

structed for use tn the test was carefuTly matched to intended domain, 

•■ . ' 
-the' corfsequently observed range of scorns .fllves supi)ort to the-heter- 



• TABLE 1 
Test Parameters 



So.- <»f Subjects «'1281 
lio. of Iteis = 56 - 

Kean = 30.713 
StiF.eard Dev. = 11.851 

Reliab. (liR-20) « .9270 



Iten Diffiailty 



Kedian = .5375 
03 « .6720 



i Point 81 serial 

i Itea Discrioi nati on 

« .5050 . 



,} i^iin = .5650 

I Q, = .6585 

1 ^ 

1 



Table 1 sunrnariies pertlnant data con^mlng' test characteris'tlcs 
associated with the total set of 56^teas. The neao score of 30.713 
corrcs.por,ds 'tc aas <.rlrj spproxlr^atcly 35: of t.*-/0 ites correctly. The 
Internal consistency of the test was found to be quite high as indicated* 
by the reliability coefficient and the itea ^iscrintination values. The • 
difficulty levels of a majority of 1tQ5is'>rere contained within a rather 
narrow range, however a ciinorl^ of itecis' diti range, froci very easy to 
very 4iff1cu1t. 

The statistics used to»describe the test often are not appropriate 
for dooain referenced tests. However, if the range of exaninee abilities 
is lar^, vaKiaeility of test scores should be expected, and reference 
procedures for evaluating the statistical qffjalities of the test ccnsittered 



acceptable. " It Is expected tha.t within the 



neous 'grtHip of studfnt6, e.g. / single classrooa, ^ apparent <}ualities 



constraints of* a more hcwoge- 



of the test would be quite different.' This! 
in thfs paper. 



last point is dtscussed^l^ter 



Oiscussion of RgsulU 

IHthin a veil-defined doraln, as used in this im^estigation, as tJ^ actual 
test size Kas reduced the average probability of aisclassif/ing a student reirained 
falrV consistent, vfhiTe tfce standard deviation irxreased slightly. For exanplp, 
vmen detensining soDPes'using all 56 iteins, and calculating average item diffi 
culties using sets of 48 litems and setting the criterion level at 109S, an 
individual with a domain scor€ of 14/56 had atn average probability of luisclassifi- 
cation of .25 with a standard deviation of .01. In conparison, with a dorain size 
of 56, 3 test size of 8, and a critericn level set at 1CK)S, an ifrd^ivldual with a 
jdonain s,core of 14/56 had an- average probability of iBisclassification of -26 with 

a standard deviation of ,05* Using another comparison exan^^le, 'ifith a domain cif 

r 

24 Iters, a test size of ^6 itenjs, and setting the criterion level at 7{JI, an 
individual with a score of 14/24 had an average probability of feisclassification of 
.5B with a standard deviation of .01. In conp^ison, wi^ a doirain size. of 24 it^s 
a test ^ize of 8 iteins, and sett4ng the critericn level ats70j5, an individual with 
a score of 14/221 had an average probability of niisclassification of .59 with a 
standard-deviation of .02. Using a well-defined doirain, sampling error did not 
ap^.,ar to have much of an effect u^n the average probability of feisclassification 
as the actual test size (i.e. the number of items used to determine the average 



ERIC 



iteni difficulties) was reduced. - , - ' * 

As the actual dor^ain size was reduced the average 'probability of aisclasst- 
fying individuals B^rforaing at specific domain levels also renained quite 

♦ 

consistent. Table 2 illustrates this by daronstrating that as the nyrber of^itens 
used to determine dorain scores was reduced frca 56 to 16, there .Ker^e only ainor 
changes in the average probabilities (and associated standard deviations) of 
oisclassifying an individual at various exaafner, performance levels within the 
docain. This would suggest that for the* present content dorain, 56 iUos pro- 
vided a fairly stable estimate of baseline probaoi litres tor class if j^ing 



4- 

Table 2 



Prob^llties of Student Kisclassificati^ns 
Various Dssaln Sizes 
Using Criterion L-avels of lOOS ajid 502 



correct cn . average st»ctiard 
doiiain probability cieyfatlcg 



Domain Size , = 55 75 .75 .CS 

Test Size ^ = "8 , . 50 . .52 .M 

CHterion Level = 10^ 25 • .26 .tS 

Domain Size = 24 75 .74 ' .02 

Test Size =3 SO .51 .02 

Criterion Le'.'el » 1005 ) 25 . 26 . 01 

Oosain Size = 16 75 ' ' .75 -05 

Test Sipe = 8 50 ' .'50 .05 

Criterion Vev?1 = 1002 25 . 25 -05 

Donain Size = 56 75 ' .25 .C^' 

Test Size , = 3 . 50 ' ',48 o. - .04 

Criterion L^el = 50Z - 25 .26 ^ 

Doaain Size = 24 . 75 . 26 . 02 

Test Size = 8 50 .49 .02 

Criterion Level = 50% 25 .26 .01 



.DoEaifi4ize = 16 75 ' -25 ."^15 

Te^t Size =8- ' 50 .50 .05 

Criterion Level = 502 25 .25 . .05 



I 




15 




indivl&iaU into Ta5sterjjr*tn-ms&t&ry catsscries at various donain psrfornatrSs 
level s« 

* 

Boldicg the donain and test siie ccnstant, as t^.s critericn level was 
decreased to .59, the p-obiility .^nnsclassifying ncn-sastery Individuals into 
tJie roast^ry category increased. Sinilarly, ufder the S£j?e ccnditjcns, as the ^ 
criCericn level tas Increased the probability of cisclassifying irastery individuals 
into the ncn-irastery categor/ d-ecreased. For exanple, with a y:^t€^ 5S-, 

a test size of. 24, and the criterion level set at 90?, there was a .89 probability 
that an individual scoring just bslow the crlterios letel had beefi irisclassified 
and a .09 probability that an 1ndivid4jal scori-ng above the criterion level had 
also been misclassified. Under the sare dOiiain and test coRditicns, bat jfith tJie 
critericn le^el set ar^CfiS, there a .50 probability .that an, individual scoring 
jiist belo« the criterion level had been jBiscXassified ^d a .48 probability t^at ' 
an iridividual scoring just above the critericn level had also been snisclassified. 

For the criterion levels incorporated in the present study (= .5 to 1.0), 
individuals wi.th dosain scores above the criterion level have the kjwest proba- 
bility of rdscl^sifications; the farther to the right a score is froni,the 
criterion le^-$fee lower this probability. Individuals scoring just below the 

•4^teri^)>4^el h.ave^^5^^^^ probability of -being aisclassified; this pro- 
bability again x4^\t,% as scores laove ^bwarard away froni the criterion level. 
Iffyse feUHonships wei^_j^^t_^^dc^3iant;<as"the criterion level deviated away 

- frc3 and above 58%, The binc^ial^^del would suggest that the relationship of 
probabilities above aad below the cr'i'terion level would be reversed for ^riteriqn 
levels beloa,50S. For instance,, the probabiTity of nisei assifying individuals 
Khose dorain scores are above the criterion level would be higher than for those 
which are belcw tfce criterion level. . 



-16- 

Application of tf^ei^al 



V IhB iRStroctcr or Instructional designer, in applying %'is Jii>ie1, 

• » ' # 

^hojld establisb values fcr each of the variables irit^i the nodel to best 
fit t3ie jxnrfltions of thg*assessnent system to vMdb icodel is to be 
applied. In in^leraenticg the todel, a user ciust take Into consideration 
the probable rajige and distribution pa.ttem of student ^oairi scores in 
ofder to incorporate the appropriate probabilities of studrat nisclassi- 
fication. Presented here is an example of bow a desigrier inight use 
this iTOdel to fcrni decisions concerning the cptinal criterion- referenced 
testing st^^tegies- The illustration is limited to using the probabili- 
ties Mirich were enipirically derived frsni the present investigation. . 

An instructor, in allocating time within the total instructional- 
effort, decided to all* no core than eight isinWs for assessing initial 
student lastery over each decision point. It vas deterained Xhat, for 
the particular iten forrat to be incorporated, allowing one cinute per 
it«n would provide sufficient tite for students to canplete the respec- 
tive tests. Froa previous experieo^, it was estimated that on an 
initial test, students' scores would be rather sysetrically distHbuted 
between. upper and lower liraiU of answering 95S to 50% of the test iteias 

■ 

correctly; laore students would be expected to perfona at^|t>e center of 
this distribution than at the extraries- It was decided to set the 

' criterion score level at .80. 

Using the relationship defined on pagC^2 between the uis*er of 
iter5S as'signed^to a decision point and other variables included in the 
oodel, it is detenained that the required response latency allows eight 
itens to be, used to assess each decision point. Hithin the esticated" 

•range^of docain scores, given eight itess assigned to* each deci^on 
point, the probabilities of classifying students into the correct 



lastery states rsnges froa .C3 to .53. Inccrpsrating the subset of 
Bisclassification probabilities, additional weights are assigned to 
probabilities associated vdth the center of the doaain score distribu- 
tion. Incorporating the subset of probabilitieS'Ojrrespcr.ding to the use 
of eight iters administered to ir*dfvidaals xith dcnain scores rangfng 
fraa SOS to 952, the probabilities of niisclassificaticns listed in 
Table 3 wwld be apjyropriate- These probabilities are extracted fma the 
larger set of probabilities derived empirically in conjunction with^e 
present investigation, (^reproduction 1 irritations prohibit r^roductior^ 
lof the coniplete tables, however, specific sections c^ be provided 
individuals up^ request*) 

' Weighting the probabilities to parallel the anticipated distri^u- _ 
tion Of dona^ scores"^ the average probability of ^fisclassifying an 
exaninee would be -26 for the conditions described. In other words, 

approxinately one out of four students Kould be classified loto the 

> 

incorrect jsastery state. 

- If this ai30unt of oisclassification vras considered unacceptable,' the 
instructor or -instructional designer could alter values given ^ the 
other v^ables Incorporated into the model. The nur&er of te^ -itenis^ 
used to assess each decision point could be increased. If the model _ 
indicated that as a consequence, tiie aniount of tine required for tesf^ 
ing %*as e>^essjve, the«s^yerage breadth of pontent assigned to each 

• * • 

decisil>B„point' could be reduced. One could also modify the criterion 
level, or alter the effectiveness of instructiSn preceding testing in 
order to change the anticipated distribution of domain scores. The 
iaost important contribution to be made by the model is that it provides 
a Beans of interrelating the consequences associated with a proposed 

. ' 18 . 



tft ProbabilttJes of-4«5CTass1ficatic^ 

^ Specific Oouain 

^ Scores Ranging Fran 50? to S5S 

27 1 *5!s 

28 ' ? ' 

33 . ^1 

34 .2 -H 
25 3 
36 



18 

i .18 



W - 3 

38 



39 
40 

41 , 5 

42 . 
43 
44 
45 
46 
47 
48 
49 



.24. 



4 ,25 

5 .29 

.25 



4 .38 

4 .41 

3 ' .4! 

3 .-42 

2 .34 

2 - .25 

2- • . .23 
2 • .18 

t^^ 1 .16 



' • - * , y 

^Dooain scores are t he nuaber pf items out tjf- ^ that would be correctljf 
answer^^ . " a ' ■ 

'^Relative weight sinply reHects a sj«etrica1 distribution whic?i is sore 
'Concentrated at the ^ter than at the extreaes. . - ' . 

^Probabilities listed are those detemined free a doo^iji of 
. using test sizes pt 8 iteST 



-19- 

assessrent strategy. Then, on the basis of ligortance of tJjB decisicn 
point and potential tine allocated, the Instructor can conslcfer 'various 
.cocfeioat^s of test sizes, criterion levels and actual prolsabiltties of 

* # 

nisclassifications. On^ various ajcfeinatlwis have' been looked at, th? 
Instructor can analyze present needs and sel^ the best oxAlnation of 
values to be assigned the various cocpon^nts in the ix>del. 

Linitations of the Hodel 

Jhree weaknesses in the present codel have been identified. Fir^t, 
there is the natter of allocating an appropriate amount of tine within 
a nastery- testing strategy to those students who fail to surT>ass the 
criterion perforaance on the initial test attenpt. In the situation 
where initial testing occurs in the classrooa and retesting Is atfcilnis- 
tered other than during class tine, (such as in a testing center accessf 
ble^?the student's discretion and need) tliae allocation is not a 
pcobleEj.- However, consideration inust be given to t^iae allocations for 
retesting when these re-evaluaticns oust occur during class tine. At 
present this allocation of tine ^s left up to the user of the nodel, 
as it is a judgenent which nust be estinated on the basis of previous 
knowledge concerning characteristics and individoal differences of the 
•students Involved. 

A second limitation of the present nodel is the establlstaent of * 
paraoeters associated with various iteffl foras and formats, tfeta for the 
present investigation was limited. to multiple choice items written for 
a well-d^ined doclain of inforiiwtlon in a science area and admintstered 
to high school science students. As the various types of items used 
and the docains are changed, the baseline parameters would need -to be 
reestablished. . • 



•2D- 

K 

' A third limitation of toe present cpdel is the lick of convenient 
procedures f or' deteralning donain score distribations and toeir associ- 
ated probabilities. of oSsclas^ificatiOTi for various dasains and -i tea 
test foTTsats. Sinilarly. accurate and detailed infornation^ about stiident 

« 

'performance under these various conditions has not been widely collected 
for use in detemining these docain score distributions and associated 
probabilities. Thi^ type of infonuation is, however, easily obtainable 
and cor?wter programs or derived tables could be-^made accessible to the 
instructional designer. 

Iiuplicatidns of the K*odel » 

This nodel ts unique in its trea^ient of' three najor areas. Firsts 
nininal work has been- done 'in the area of deterniining just how many test 
itens are needed to classify a person iqto castery/non-oastery categories 
Kith respect to a previously stated area with eV^ven level of confidence. 
Host significant works in the area eventually rest upon the binonial- 
coder in deriving probabilities of student nisclassifications. <^ince 
the' average probability with which a student will correctly answer each • 
iten is a function of the specific itetas utilized or sampled, it would 
be better to apply a binoraial ^pansion S the probabilities associated 
with the items actually-^selected rather than to estiraates of population 
parasieters based on Supposed docaiiis of equivalent itecis. 

A second unique charactensti(^ of the nodel is the adaptability 
to specific types of students. Specifically, as infonnation is collected 
on how sinilar groups perfora on a specific type of test more efficient 
decisions can be rade in regard to teisting the next group. As classes 



21 



differ in ability and acttfal perfcrnaoce, the test fegth an^criter- 

■ • . ' " • 

ion level can i>e acUusted to accoOTOdate desired pro!»ii1ities of 

• I 

oisclassificaticns. 

The third unique feature of tJtk ocvdel is that by using this pn>cedure 
the instructional developer or.indivfdual instroctcr is provided a neanl 
by viiich the interrelationships of various test-related factors becaae 
apparent. ' Coopcnents which previously have beev onsider^ independent- 
ly, without attention to interrelationships, are incorporated in such 
a manner within the opdel that ^en one component is altered the user 
becones aware of the resulting irjplications to changes in the regaining 
elements of the codel. The code! provides the user with a ciethod for 
selecting values for socne cotnponents in the model and detercrining the 
resulting values for the resnaining elements. If not satisfied with the 
first resultp^ the user can repeatedly go through the laodel, changing 
values until a satisfactory cosAination is obtained* In this nanner, the 
user is able to concurrentjjjy consider any individual or group of factors 
.that will affect or be affected by any other decisions nade during the J ^ 
developcnental period of instruction and related criterion-referenced . \ 

tes.ts used for student assessnent. - 

• * " . « 

Directions 'for Future^search 
In regard to establishing probabiHties .of student cjasslfications* 
two major areas require further inviestigation: 0) the effect of varfpus ^' 
types of iten fonuats, and (2) *the effect Qf other content areas or 
domains. The effect of usin^ different types of itenj foriaats, other than - 
sijple sniltiple choice, for the establishment of baseline -probabilities 
needs to be dfetermined. Siuilarly, tfie establishfient of these probabi- 
lities as dtf/erent'contcnt areas and less well-deftned.dMralns 'are used 



lorrants further Investigation. As to- stuisnt tine recpjired to respond, 
jaost infoniation about response latency tine for various iten fomats'ls 
ia the area of esti^tes. Specific <Jata in regard to well-describal 
types of iteis and students should be collected to aid the instructional 

« 

designer or instructor in allocating testing tice. Finally, a neans fay 
which the inodel could be easily nsanipulated by the instructional designer 
Of instructor requires refitisuent. 

As to the actual tnodel, tije authors welcone and encourage pet^le to 
take exceptioiTto It. The components selected for inclusion in the Jiodel 
♦and the canner in which they v-ere integrated represent only one of 
-^ilfcus ^ssible options. A najor advantage to this nodel is that it 
provides i way of integrating components to encourage and facilitate 
the instructional designer jor instructor to concurrently con&^er effects 
that any decision about one! copponent ^as upon the rerualning elenents. 

The consequence of not Concurrently considering ccxnponents vrlthin 
.the codel has resulted in decisions being cade abc>ut various components 
with no concern for the effect, these decisions have on other elecsents. 
Of particular significance is that the Instructor and designer quite 
freely and arbitrarily select the number of items necessary to assess 
Student cocnpetence over an^area without questioning the probability of 
an Individual being placed into the correct or Incorrect nastery state. 
Further, the instructor or designer Is left at ^e point where guesses 
are bejng cade in such critical ^reas as to how many Iteos should be ' 
included on a criterion-referenced test, what passing level should be 
used, and how correct student classifications realty are. The develop- 
raent.and appllcatlpfr of the presentTy proposed model, or socse other, 
cotaprehensive raode] whtch allows the user to Identify and relate specific 

• ' ■ . ' ■ ' - " 

23 . 



-23- 



cocponents vhitti affect the optical construction and ic^jlersentation 

* . • * |. ■' 

strategies of criterion-referewed tests is- essential . I 

- • •.- A ■ 

In this investigatlbrr a cocprehensive nodal Vfas proposed vhith . . 
would facilitate the instructicfnajl designer or individual instnictor in 
concurrently considering various «Kiponarts -which affect tbe* construct i en 
of cHterion-referenced tests used to cake pass/no-pass decisions in - 
regard to specified decision points. . Components' within th^'rodelM^j^ 
elude the average response latency tiae associated with thJ^ecific 
itea foroat, the total anount of student -titae to be allocated to testing, 
the nuaber of iters selected for deteroining pass/no-pasf decision? on 
each decision point, the nunber of isastery-st^tuS decisions made within 
the course; and the probability for individuals i^erforning at a speci- 
fled true dosain score of being placed in the correct castcry st^t 

In order to denonstrate the espirical detenairiation of probabiH- 

• * • 

ties of correctly classifying individuals into pasterx/non-nastery 
categories, a 55 item test covering a well-defined domain of scfenfe 
information was adcrfoistered toTtSl'high school students. Baseline 
probabilities were determined under various domain and test sizes, using 
six different criterion levels in each case. Binori^al expansions. wei:e 
thCT used to determine probabilities as test lengths were increased.^- 

The data collected suggested that wheft a wel J-def ined* douaiWis 
established, as the* actual domain size was reduced the average' probabi- 
lity of nrtsclassifying individuals at specific dODainJievels reiuained 
fairly consistant, " Further, as the. actual test size-^s reduced' the 
baseline 4)robabiHty Of nisei asslfying'^n individual at a giveimjain 



24- 



level also reBa1ne<i fairly consistantj while the standard deviation 
increased slightly. Increasing the criterion level resulted in an 
Increase in the probability of trfsclassifying in3ividials with docain 
5a>res above the criterion levei, **ile decreasing thecriiferion level 
resulted in an increase .1o the probability of 'oisclassifying Individuals 

vfith da^in scores beloiW'that criterion level. ^ - 

... I . 

Application of the cod^l deapnstrated how the <i&1gner or instrtictor 
coaJd ranipulate ^orapooents within the nodel in order to select the cost 
efficient coabination of factors to n»eet present needs. It nas also 
deronstrated- how this codel could be used to i3aKe testing decisions for 
specific types of. students based upca-ist-fnatte of student perfornance • 
*rith the content docialn. -The need for further research in the" area of 
.other docsains and vaHous types of 1 tea forcats was pointed out. Ifest 
icportantlyi. the need for further work in deyeloping-roc?)reh^ns1ve 
laodels which provide the desi^er or instructor with easy nethods for 
concurrently considering various test related cocponents has been identi- 

fie?.' " :> ^ 



Rsferencgs ' ^ • 

♦ '-■ - 

Baker» Eva. Beyond objectives: dooain-refcrenced tests for 
evaluation and Instructional injproveoent. Educational , 
Technology . 1974 14(6).'l0'16. ' . ■ ^ 

Cronbachj tee 0: & Gleser, fioldine C. Psvchol^tcal tes ts^and 
personnel decisions . (2nd ed.). Urbana: l^ivprsity of 
ntinois Press, 1SS5. * • > ' 

Gagne, Robert Vu, & Briggs. ieslie J. tPH nciples of instrue- 
tional design , liew Yprkr Holt, Rin^rt, i Kinston, 1974, 

Haableton, Ronald K. & Kovick, Kelvin R; . Towrd an integration 
• of theory and nethod f^^r cHterion-referenced tests- 
Joucna? o-f Edt;catio_nal_Reas^g3ent> I973i 10(5), 159-l'7i). 

Kovick*, Kelvin R.> & Lewis, Charles. Prescribing test ^sth . 
.for cHterion-refkr»cedlnteasur^ait. Iowa City: The • ^ 
: AneHcan .College Testing Fro^rara, Research and Devfloprnent 
Division, 3974. (ACT Technical' Biflletin Kd.. 18). 

Thomdike, Robert L. i jfegen,. llizabetlu Keasureoent and Evalu- 

^ ation in psycholdgy a rid education ,, {3rd ed . ) flew York; 
' John Wiley A Sore, 1969. .'. . 



